73 research outputs found

    A hierarchical Bayesian network approach for linkage disequilibrium modeling and data-dimensionality reduction prior to genome-wide association studies

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Discovering the genetic basis of common genetic diseases in the human genome represents a public health issue. However, the dimensionality of the genetic data (up to 1 million genetic markers) and its complexity make the statistical analysis a challenging task.</p> <p>Results</p> <p>We present an accurate modeling of dependences between genetic markers, based on a forest of hierarchical latent class models which is a particular class of probabilistic graphical models. This model offers an adapted framework to deal with the fuzzy nature of linkage disequilibrium blocks. In addition, the data dimensionality can be reduced through the latent variables of the model which synthesize the information borne by genetic markers. In order to tackle the learning of both forest structure and probability distributions, a generic algorithm has been proposed. A first implementation of our algorithm has been shown to be tractable on benchmarks describing 10<sup>5 </sup>variables for 2000 individuals.</p> <p>Conclusions</p> <p>The forest of hierarchical latent class models offers several advantages for genome-wide association studies: accurate modeling of linkage disequilibrium, flexible data dimensionality reduction and biological meaning borne by latent variables.</p

    A Bayesian network approach to model local dependencies among SNPs

    Get PDF
    In this preliminary work, we investigate a method to model linkage disequilibrium among SNPs (Single Nucleotide Polymorphisms) in the genome. The genetic data such as SNPs is characterized by a typical block-like structure along the genome. Graphical models such as Bayesian networks can provide a fine and biologically relevant modeling of dependencies for both haplotypical and genotypical SNP data. We applied a MWST-based algorithm (Maximum Weighted Spanning Tree) to construct a Bayesian network, relying on the underlying local dependencies

    Visualization of Pairwise and Multilocus Linkage Disequilibrium Structure Using Latent Forests

    Get PDF
    Linkage disequilibrium study represents a major issue in statistical genetics as it plays a fundamental role in gene mapping and helps us to learn more about human history. The linkage disequilibrium complex structure makes its exploratory data analysis essential yet challenging. Visualization methods, such as the triangular heat map implemented in Haploview, provide simple and useful tools to help understand complex genetic patterns, but remain insufficient to fully describe them. Probabilistic graphical models have been widely recognized as a powerful formalism allowing a concise and accurate modeling of dependences between variables. In this paper, we propose a method for short-range, long-range and chromosome-wide linkage disequilibrium visualization using forests of hierarchical latent class models. Thanks to its hierarchical nature, our method is shown to provide a compact view of both pairwise and multilocus linkage disequilibrium spatial structures for the geneticist. Besides, a multilocus linkage disequilibrium measure has been designed to evaluate linkage disequilibrium in hierarchy clusters. To learn the proposed model, a new scalable algorithm is presented. It constrains the dependence scope, relying on physical positions, and is able to deal with more than one hundred thousand single nucleotide polymorphisms. The proposed algorithm is fast and does not require phase genotypic data

    Modélisation pangénomique du déséquilibre de liaison à l'aide de réseaux bayésiens hiérarchiques latents et applications

    No full text
    Recent high-throughput genomic technologies opened the way for association studies aiming at the genome-wide characterization of genetic factors involved in complex genetic diseases, such as asthma and diabetes. In these studies, linkage disequilibrium (LD) reflects the existence of complex dependences in genetic data and plays a central role, since it ensures a precise localization of genetic factors. Nevertheless, the high complexity of LD, as well as the large dimension of genetic data, represents strong difficulties to consider. Research works of this PhD were carried out in this context. The contribution of research works presented here is twofold, since it is both theoretical and applied. On the theoretical side, we proposed a new approach of LD modeling. It is based on the development of a model coming from artificial intelligence and machine learning, the forest of hierarchical latent class models (FHLCM). The most significant contributions introduced are the ability of taking into account the fuzzy nature of LD and organizing into a hierarchy the multiple LD degrees. A novel scalable learning algorithm, named CFHLC, was developed in two versions: the first requires to split genome into contiguous windows to resolve the scalability issue, and the second (CFHLC+), more recent and advanced, implements a sliding window on chromosome. Using a real dataset, the comparison of the CFHLC method with others revealed that the former offers a more accurate modeling of LD. Besides, learning on data showing varying LD patterns showed the ability of FHLCM to faithfully reproduce the LD structure. Finally, the empirical analysis of learning complexity showed linearity in time when the number of variables to process increases. On the applied side, we explored two research avenues: causal discovery and global and intuitive visualization of LD. On the one hand, a systematic study of the ability of FHLCM for causal discovery is illustrated in the context of genetic association. This work established the basis of the development of novel methods for causal genetic factor identification in genome-wide association studies. On the other hand, a method was developed for the global and intuitive visualization of LD into three main contexts that geneticist can meet: visualization of short-range, long-range and genome-wide LD. This new method brings several assets as follows: (i) both pairwise LD (two variables) and multilocus LD (more than two variables) are simultaneously displayed, (ii) short-range and long-range LD are easily distinguished, and (iii) information is summarized in a hierarchical manner.Les récentes technologies génomiques à haut-débit ont ouvert la voie aux études d'association visant la caractérisation systématique à l'échelle du génome des facteurs génétiques impliqués dans l'apparition des maladies génétiques complexes, telles que l'asthme et le diabète. Dans ces études, le déséquilibre de liaison (linkage disequilibrium, LD) reflète l'existence de dépendances complexes au sein des données génétiques et joue un rôle central, puisqu'il permet une localisation précise des facteurs génétiques. Néanmoins, la haute complexité du LD, ainsi que la dimension élevée des données génétiques, constituent autant de difficultés à prendre en compte. Les travaux de recherche réalisés au cours de cette thèse se sont placés dans cette perspective. La contribution des travaux de recherche présentés est double, puisqu'elle est à la fois théorique et appliquée. Sur le plan théorique, nous avons proposé une nouvelle approche de modélisation du LD. Elle est basée sur le développement d'un modèle issu du domaine de l'intelligence artificielle et de l'apprentissage automatique, la forêt de modèles hiérarchiques à classes latentes (FMHCL). Les nouveautés les plus significatives introduites sont la possibilité de prendre en compte la nature floue du LD et de hiérarchiser les différents degrés de LD. Un nouvel algorithme d'apprentissage supportant le passage à l'échelle, nommé CFHLC, a été développé et décliné en deux versions: la première nécessitant le découpage du génome en fenêtres contiguës pour résoudre le problème de passage à l'échelle, et la seconde (CFHLC+), plus récente et évoluée, résolvant le problème au moyen d'une fenêtre glissante sur le chromosome. A l'aide d'un jeu de données réelles, la comparaison de la méthode CFHLC avec des méthodes concurrentes a montré qu'elle offre une modélisation plus fine du LD. En outre, l'apprentissage sur des données présentant des patrons de LD variés a démontré la capacité de la FMHCL a reproduire fidèlement la structure du LD. Enfin, l'analyse empirique de la complexité de l'apprentissage a montré la linéarité en temps lorsque le nombre de variables à traiter augmente. Sur le plan appliqué, nous avons exploré deux pistes de recherche: la recherche de causalités et la visualisation synthétique et intuitive du LD. D'une part, une étude systématique de la capacité des FMHCL à la recherche de causalités est illustrée dans le contexte de la génétique d'association. Ce travail a établi les bases du développement de nouvelles méthodes de recherche dédiées à la découverte de facteurs génétiques causaux pour les études d'association à l'échelle du génome. D'autre part, une méthode a été développée pour la visualisation synthétique et intuitive du LD adaptée aux trois principales situations que peut rencontrer le généticien: la visualisation du LD de courte distance, de longue distance et dans un contexte pangénomique. Cette nouvelle méthode apporte des atouts majeurs qui sont les suivants: (i) le LD par paires (deux variables) et le LD multilocus (deux variables ou plus) sont simultanément visualisés, (ii) le LD de courte distance et le LD de longue distance sont facilement distingués, et (iii) l'information est synthétisée de manière hiérarchique

    Semi-supervised learning improves regulatory sequence prediction with unlabeled sequences

    No full text
    Abstract Motivation Genome-wide association studies have systematically identified thousands of single nucleotide polymorphisms (SNPs) associated with complex genetic diseases. However, the majority of those SNPs were found in non-coding genomic regions, preventing the understanding of the underlying causal mechanism. Predicting molecular processes based on the DNA sequence represents a promising approach to understand the role of those non-coding SNPs. Over the past years, deep learning was successfully applied to regulatory sequence prediction using supervised learning. Supervised learning required DNA sequences associated with functional data for training, whose amount is strongly limited by the finite size of the human genome. Conversely, the amount of mammalian DNA sequences is exponentially increasing due to ongoing large sequencing projects, but without functional data in most cases. Results To alleviate the limitations of supervised learning, we propose a paradigm shift with semi-supervised learning, which does not only exploit labeled sequences (e.g. human genome with ChIP-seq experiment), but also unlabeled sequences available in much larger amounts (e.g. from other species without ChIP-seq experiment, such as chimpanzee). Our approach is flexible and can be plugged into any neural architecture including shallow and deep networks, and shows strong predictive performance improvements compared to supervised learning in most cases (up to 70%70\% 70 % ). Availability and implementation https://forgemia.inra.fr/raphael.mourad/deepgnn

    Probabilistic graphical models for genetics, genomics, and postgenomics /

    No full text
    Includes bibliographical references and index
    • …
    corecore